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ABSTRACT 

This report provides an overview of the current 
applications of CMqputer technology to construct test items and/or to 
formulate tests according to sound measurement principles. The test 
items may be computer-generated from strategies programmed by test 
constructors, or pre-constructed by item writers and stored in 
computer memory. The tests formulated may be administered 
interactively by the computer or as paper and pencil tests. Studies 
dealing with computer applications in item construction, item 
banking, test design, and test administration (both adaptive and 
nonadaptive) are grouped for review in four sections: (1) theoretical 
and philosophical propositions; (2) applications and implementations; 
(3) evaluation and research; and (4) prospects for the future and 
implications for educational testing. It is concluded that while 
there have been many attempts to utilize computers for test 
construction, actual successful, large scale applications are 
relatively few. Most of these simply use computers to replace pencil 
and paper tests or human labor, with the exception of adaptive 
testing, there is little documentation to show that the quality of 
assessment processes is in^roved by computer utilisation. However, 
with continuing rapid technological developments to overcome current 
computer limitations, and with attention to measurement quality, the 
future of computer-assisted test construction should be very bright. 
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*********************************************************************** 

* Reproductions supplied by EDRS are the best that can be made * 

* from the original document. * 
*********************************************************************** 



U» OeWkHTMeMTOfiOOCATIOM 

EOUCATlONAL^REMUfiCESlNFOBMATlON 

originatino d 
□ M.nor ch.no«. h.vt btin mi<J« to .mpcovi 
rtproductton qui»Hy 

. Pent* Ot v«w or OP.n.ont ^^^^^^ 
OERI petition Of po*icy 



COMPUTER-ASSISTED 
TEST CONSTRUCTIONS 



A State of the Art 



TME REPORT 88 



by 

Tse-chi Hsu 
University of Pittsburgh 

Shula F. Sadock 
Pittsburgh Board of Education 




EDUCATIONAL TESTING SERVICE 

PRINCETON, NEW JERSEY 08541-0001 



COMPUTER-ASSISTED TEST CONSTRUCTION: 
THE STATE OF THE ART 



by 



Tse-chl Hsu Shula F. Sadock 

University of Pittsburgh Pittsburgh Board of Education 



November, 1985 



ERIC Clearinghouse on Tests, Measurement, and Evaluation 
Educational Testing Service, Princeton, New Jersey 08541-001 



The material In this publication was prepared pursuant to a 
contract with the Office of Educational Research and 
Improvement, U.S. Department of Education. Contractors 
undertaking such projects under government sponsorship are 
encouraged to express freely their Judgment in professional 
and technical matters. Prior to publication, the manuscript 
was submitted to qualified professionals for critical review 
and determination of professional competence. This 
publication has met such standards. Points of view or 
opinions, however, do not necessarily represent the official 
view or opinions of either these reviewers or the Office of 
Educational Research and Improvement . 



ERIC Clearinghouse on Tests, Measureme'^t , and Evaluation 
Educational Testing Service 
Princeton, NJ 08541 




Office of Educ^kml 
Researdi and Impramient 

Oft'CS'LP-PIP-B 



This publication was prepared with funding from the Office of 
Educational Research and Improvement, U.S. Department of 
Education under contract No. NIE-400-83-0015 . The opinions 
expressed in this report do not necessarily reflect the 
positions or policies of OERI or the Department of Education. 



ERIC 



4 



Table of Contents 



Page 



I. INTRODUCTION 1 

II. THEORETICAL AND PHILOSOPHICAL PROPOSITIONS 5 

Item Construction 5 

Item Banking 6 

Test Design 9 

Test Administration 15 

III. APPLICATIONS AND IMPLEMENTATIONS 19 

Item Construction 19 

Item Banking 21 

Test Design 30 

Test Administration 37 

Adaptive Testing 37 

Nonadaptive Testing 38 

IV. EVALUATION AND RESEARCH 45 

Item Construction 47 

Item Banking 49 

Test Design 54 

Test Administration 55 

V. PROSPECTS PGR THE FUTURE 63 

Item Construction 64 

Item Banking 66 

Test Design 67 

Test Administration 67 

Implications for Educational Testing 68 

APPENDIX A: Sample of a User's Evaluation Form 71 

REFERENCES .73 



5 



I. INTRODUCTION 



Educators have attempted to apply computer technology to 
testing since the emergence of computers. The earliest and most 
successful apllcations are probably in test scoring, test 
reporting, and item analysis (Baker, 1971). Although many 
attempts have been made to apply computers to other aspects of 
testing, the degrees of success vary. In this paper, we will 
not attempt to provide a complete account of computer testing 
history. Rather, we will try to give a summary of the state of 
the art of computer-assisted test construction. We hope that 
the summary will be useful to the developers, researchers, and 
implementers of computer-assisted test construction systems. 

Before we proceed to the main theme of the paper, however, 
we must describe our concept of computer-assisted test 
construction, because the term has been used to represent 
different activities by different people. Our concept of 
computer-assisted test construction includes any activity which 
involves the utilization of computer technology to construct 
items and/or to select a set of pre-constructed items to form a 
test. This concept emphasizes the application of computers to 
assist in the selection of items based on sound measurement 
principles. The items may be pre-constructed by Item writers 
and stored in the memory of computers or generated by the 
computer from strategies programmed by test constructors. The 
test formulated through this process may be administered to 
pupils by the computer interactively or printed on paper and 



administered as a paper and pencil test* Using this concept, 
we will review only studies dealing with item construction, item 
banking, test design, and test administration, either adaptive 
or nonadaptive. 

Item construction concerns the utilization of computers in 
constructing or generating test items. Item banking deals with 
the systematic storage and subsequent retrieval and/or 
modification of previously constructed items. Consequently, our 
emphasis here is on item banking systems. If only item 
attributes, such as identification ntunbers and statistics, 
instead of items per se, are stored in the bank for the purpose 
of selecting items for a test, the process is considered in ^he 
category of test design. Item banks that involved no item 
classification and/or item selection strategies will be 
excluded. Test Administration includes applications utilizing 
computers to identify items from a larger item pool and 
administering the items to the students. The emphasis is on 
whether the computer is utilized to improve the quality of test 
administration. Using this criterion, we may include the 
majority of adaptive testing systems. Computer-assisted 
nonadaptive testing systems are included only if they appear to 
offer some advantages over the traditional paper and pencil, 
group administration approach. Thus, the administration of 
standardized tests on computers will be included. Tests 
administered as part of computer-assisted interaction (CM) 
lessons will not be discussed because they cannot be considered 
independently from the CM strategies, which are not the primary 
concerns of this article. 



In addition to classifying studies according to item 
construction, item banking, test design, and test administration 
categories, we have further grouped them into four sections in 
the review. Studies dealing with theoretical and/or 
philosophical propositions will be grouped into Section II. 
Some of the ideas described in this section may have been 
incorporated into practices already. Others may still be in the 
stage of experimental trials. The objective of this section is 
to present the researcher's concepts of how computers should be 
utilized. 

Section III contains applications implemented on both 
mainframes and microcomputers. Many of the applications 
appearing in the 1970s were designed for mainframes, while the 
1980s are characterized by applications designed for 
microcomputers. Since the applications before microcomputers 
have a long history, a great deal of literature can be included 
in this section. Emphasis will be placed on applications 
implemented after approximately 1973. Studies published before 
that will be included only if they have implications to later 
developments. Readers interested in earlier developments may 
wish to consult Lippey (1974), Byrne (1976), and the 1973 
special issue of Educational Technology on computer-assisted 
test construction. Most of the applications which emerged 
during the last few years involve microcomputers. Since 
microcomputers are so popular these days, it is important to 
have a good assessment of the status of these applications. 
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Section IV consists of evaluation and research Issues 
associated with test construction applications. These issues 
may be related to either mainframes or microcomputers. Studies 
dealing with empirical investigations of theoretical issues or 
with evaluations of various applications may be included also« 

Prospects for the future will be discussed in Section V. 
It will Include a survey of prospects offered by researchers and 
our observations. Implications for the future of educational 
testing will be described also. 



II • THEORETICAL AND PHILOSOPHICAL PROPOSITIONS 



Ideas about how computers should be used in test 
construction are the seeds for innovations. In this section, we 
are going to summarize some of the ideas appearing in recent 
literature according to the four categories posted previously. 
Readers seeking additional ideas may also consult Baker (in 
preparation); Hambleton (1984); Hambleton, Anderson, and Murray 
(1983); Oosterhof & Salisbury (1985); Roid (1984b); and Sampson 
(1983) • 

Item Construction 

Using computers to construct items is not a new concept. 
Anastasio and his associates attempted to use computers to 
construct sentence completion and spelling items in the late 
1960s (Anastasio I Marcotte, & Fremer 1969; Fremer & Anastasio, 
1969). These works, however, were never really adopted by test 
constructors. Several researchers attempted to generate items 
in the early 1970s (Ferguson & Hsu, 1971; Feldker, 1973; 
Vickers, 1973). But the strategies of item generation are 
different » One of the most common approaches is to generate 
items based on item forms which represent a specific domain of 
contents* 

A prerequisite for using computers to construct items is to 
develop algorithms for item construction. These algorithms must 
be based on various item writing techniques. Although the 
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interest in item writing techniques is not nr^w, recent interest 
in this topic focuses on how these techniques may be 
computerized. For example, Hsu (1975) discussed four 
achievement test construction approaches: Guttman's facet 
design, Hively's item form analysis, Scandura's algorithmic 
analysis, and Bormuth's operational approach (or linguistic 
transformation). These and other techniques are illustrated and 
discussed In detail by Millman (1974), Roid and Haladyna (1982), 
and Roid (1984a). Attempts to computerize some of these 
techniques have been made. For example, item form analysis has 
been implemented In Ferguson and Hsu (1971), Hsu and Carlson 
(1973), and Millman and Outlaw (1978). The facet design has 
been tried by Berk (1978). The linguistic transformation 
approach has been utilized by Finn (1975) and Roid and Finn 
(1978). Some of these applications will be described in more 
detail later. 

Item Banking 

An item bank is a collection of items that has been 
organized and classified in terms of the content ard/or the 
statistical characteristics of the items. Most bmked items are 
objective Items such as multiple-choice and true-false. In this 
section we are not concerned with describing existing item 
banks. Rather, our emphasis is on computerized item banking 
systems. The purpose of an item banking system is to catalogue, 
modify and maintain a set of items. Before developing an item 
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bank* one may want to make sure whether an item bank is really 
needed. Millman and Arter (1984) pointed out that for an item 
bank to be valuable, at least one of the following conditions 
must be met: 

1. Tests constructed according to local specifications are 
needed and not yet available; 

2. Frequent testing is required; 

3. Multiple forms of a test are needed; 

4. Individually tailored tests are desired; 

5* Multiple users and/or contributors are willing to 
cooperate; 
and/rr 

6. An item banking system is available. 

An item bank to be useful for test construction, however, 
Is not easy to design (Hiscox, 19B4a) . Several factors must be 
considered. First of all, items in the bank must be classified 
meaningfully and systematically. Item classification systems 
should not be exclusively governed by concerns for quick and 
efficient retrieval of items. Rather, proper classification 
should aid in improving the validity of the test to be 
constructed. The Item classification system should be dependent 
upon the purpose of the Item bank. We cannot expect one 
classification scheme to be used for all purposes. 

The second criterion which should be considered is whether 
the bank Is easy to maintain. This may include procedures for 
creating, storing, retrieving, and modifying items. Some word 
processing capability is desirable. But this capability should 
not occupy space needed for manipulating the item bank. 
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Item evaluative data are needed for selecting good quality 
items. A good bank should be used to maintain and to update 
these data for the users. Since there are many varieties of 
item data, the bank should be flexible in terms of the kinds of 
statistics needed by different users. It will be most desirable 
if the users have the option to choose the kinds of item 
statistics to meet their needs. 

Another factor which should be considered is whether the 
procedure for assembling items is adf^quate. A commonly used 
procedure is the selection of items one at a time by the users. 
One advantage of this appro&ch is that the user has a chance to 
evaluate each item carefully and then decide whether the item 
should be used. This approach, however, is very inefficient 
especially when the bank is relatively large. If possible, 
items meeting criteria specified by the users should be selected 
first before examining items one at a time. Random selection of 
items without users* review is not desirable. 

The final criterion which should be considered is the 
flexibility of the item bank usage. In assembling the tests to 
be printed, who will decide the order of the items to be 
printed? Is it possible to use the same bank for on-line 
testing? Is it possible to print items directly on stencils or 
dittos so that many copies of the tests can be made easily? 

Although the technical quality of programming is desirable, 
the technical quality of testing should not be sacrificed. Item 
banking is not simply a means of storing items, but should 
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assist users in selecting high quality items for a specific 
purpose.. Therefore, one or two seconds delay in retrieving an 
Item is probably not as important as the question of whether 
this is the most desirable item for this purpose. Does the bank 
incorporate enough measurement principles so that it can provide 
sufficient clues to the users about the quality of the Item? 
This should be the primary consideration in designing an item 
bank. 

In addition to the criteria mentioned above, other issues 
regarding item banking can be found in Estes and Arter (1984) 
and Mlllman and Arter (1984). Besides describing the advantages 
and the disadvantages of an item bank, the last reference also 
provides an extensive list of questions to be answered when 
designing an item bank and determining the type of item 
information thax may be stored in the bank. Readers interested 
in designing an item henk based on the Rasch model may consult 
Wright and Bell (1984). 

Test Design 

The category of test design considers the test as a whole. 
The primary concern of a test, of course, is its quality. Two 
major indicators of quality are validity and reliability. How 
computers may be used to design a test and Judge its quality is 
the central theme of this section. In order to design a test 
and judge its quality, item and test statistics should be 
computed and evaluated. Since using computers to generate 
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Item/test statistics is not Included here, we focus only on the 
evaluation of Item/test statistics for the purpose of Item 
selection. 

In planning for a classroom test, Nltko (1983) suggested 
the following major steps: 

1. Define the purpose for testing at this time. 

2. Specify the performance and processes to be observed 

and tested. 

3. Select the type of test Items or the methods to be used 

to observe and to test the performance. 

4. Develop the Initial drafts of the test exercises. 

5. Are the Itesas of satisfactory quality? If not, revise 

or reconstruct the Items. 

6. Do the Items match the stated performance to be 

assessed? If not, revise or reconstruct the Items. 

7. Conduct a preliminary tryout of Items, If possible. 

8. Do the Items appear to be functioning as Intended? If 

not, revise or reconstruct the Items. 

9. Develop the final version of the test. 

10. Administer the test and analyze the results. 

11. Does the test appear to be functioning well? If not, 

revise or reconstruct the Items. 

12. Use the test for decision-making. 

We cannot expect the computer to assist us In all 12 
steps. But an Innovative researcher may be able to find some 
ways to utilize the computer for certain functions in each 




step. One possible exception is probably Step 1, defining the 
purpose of testing. Steps that are most relevant to test design 
are 3, 4, 6, 7, and 9. Step 5 is also a part of test design. 
In this article, however, it is classified in the category of 
item banking, which maintains actual items. 

At Steps 3 and 4, computers may be utilized to assemble 
Items according to content and/or item types. At Steps 6, 7, 
and 9^ the computer may be used to store, compute, and display 
the item/test statistics needed for Judging the quality of 
items/test. For example, based on item difficulty and 
discrimination indices obtained from previous testings, an 
estimate of the reliability coefficient of the new test can be 
made. This procedure has been implemented in the Pittsburgh 
Educational Testing Aids system (Nitko & Hsu, 1984a). 

If estimators of item parameters based on item response 
theory (IRT) are available, the potential of estimating the 
quality of a new test is even greater. Since IRT offers a means 
by which item and test characteristics can be independent of the 
performance of some tryout group, **it becomes possible to 
describe in precise terms the characteristics of the test before 
the test is administered. This capacity allows one to construct 
a test that is highly efficient in accomplishing the puipose of 
the test" (Warm, 1978, p. 17). 

Many different IRT models have been proposed for designing 
tests including the one-, two-, and three-parameter models. 
Comparative studies (e.g., Koch & Reckase, 1978, 1979; Urry, 
1970, 1977) have investigated the utility of employing the one- 
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and three-parameter (IPL and 3PL) models. Although the 3PL 
model yields higher reliability, It Is prone to nonconvergence 
of ability estimates. Nonconvergence Is rarely a problem with 
the IPL model and McKlnley and Reckase (1984) recommend this 
model when small Item pools are used. 

Numerous calibration procedures exist for obtaining item 
parameters and Hbl^Hty estimates (e.g., BICAL by Wright, Mead, & 
Bell, 1960; BILOQ t)y Bock & Aitkin, 1981; and LOGIST by 
Wlngersky, Barton, & Lord, 1982). The mathematical complexity 
of the models, however, necessitates the use of a mainframe 
computer to obtain these estimates. Many attempts have been 
made to computerize the test design applications of IRT. While 
some systems have Incorporated the parameter estimation 
procedures, other approaches rely on precalibrated item .a 
(e.g.. Holmes, 1983; Sadock, 1984). Before presenting some 
computerized applications of IRT, we will examine the 
theoretical basis of some of these test design systems. 
Following the test development applications of IRT, Lord 
suggests the following approach for designing a mastery ^est : 

1. Obtain a pool of items for measuring the skill of 

Interest. 

2. Calibrate the items on a convenient sample. 

3. Considering the entire Item pool as a single test, 

calculate the test characteristic curve. 

4. Define mastery in terms of true score. 

5. Find the o-cutoff equivalent using the mastery score 

and the test characteristic curve. 
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6. Evaluate the item Information at the o-cutoff 

equivalent. 

7. Decide what length confidence interval for o will be 

adequate at the cutoff equivalent. Using this 
information, determine the required test information 
at the cutoff. 

8. Select items with the most information at the cutoff 

and continue selecting until the sum of the item 
information at the cutoff equals the required test 
information. 

9. Compute scoring weights for each item selected. 

10. Compute the weighted sum of item scores for each 

examinee. 

11. Compute the cutoff score. 

12. Administer the test and select examinees with scores 

greater than the cutoff score. (Lord, 1980, pp. 
174-175) 

In computerizing the above procedures, several 
modifications have been made including provisions for specifying 
several cutoff scores (i.e., the design of classification tests) 
and the selection of items based on a maximum information range 
for each Item. For example, the IRT-based test design system 
developed by Sadock (1984) allows users to specify up to four 
cutoff scores. In addition, under certain conditions, items are 
selected if their point of maximum information falls within a 
fixed range on the ability scale. A more precise indicator of 



maximum item information has been proposed by Reckase & McKinley 
(1984). They suggest selecting items based on a variable length 
item effectiveness range. In addition, a new item difficulty 
parameter is defined as the midpoint of the effectiveness range. 

Theoretical investigations by Samejima (1977) and Thissen 
(1976) have resulted in several new IRT models which incorporate 
information available from incorrect responses in estimating 
ability levels. Using response characteristic curves and 
response information curves. Woods (1983) describes a 
computer-aided item development procedure based on Samejima' s 
models. IRT applications for both item construction and test 
design are used in this procedure. 

The theoretical IRT literature is quite extensive. Over 
the past five years, much of the IRT literature has been 
centered around test development applications. The mathematical 
complexities of the IRT models, in many cases, require the use 
of a computer in applying the models to practical testing 
situations. Although many packages exist for calibrating items 
on mainframe computers, few calibration procedures exist for 
microcomputers. Baker (in preparation) has identified one 
microcomputer calibration system. MICROSCALE (Mediax, 1984) is 
a microcomputer version of the BICAL program which is based on 
the Rasch model. 

The memory size and speed of micros significantly 
contribute to the current lack of software. However, because of 
the increasing use of micros in designing tests, as well as 
increased availability of add-on memory, we see a need for 
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procedures for approximating Item and test parameters which may 
be Implemented on microcomputers. Microcomputer-assisted test 
design systems which currently employ IRT principles Include 
calibrated Item data In the Item bank thereby eliminating the 
need to calibrate the Items with the micro. However, this 
approach requires the use of both mainframe and micro In 
developing such a system. 

Test Administration 

Administration of tests by the computer Is justifiable only 
If It can Improve the quality of testing. For example, the 
quality of testing can be Improved by using the computer: (a) to 
provide Immediate feedback, (b) to select the next Item based on 
the response, (c) to store and analyze test results, and (d) to 
Increase test security. But there are also many difficulties In 
administering tests using the computer such as: (a) the need for 
one computer (or terminal) for each student, (b) difficulty In 
tracking omitted Items for review once all Items have been 
attempted, (c) limited space In one screen, (d) limited memory 
storage, (e) slow speed printers for printing test results, and 
(f) difficulty In overcoming "computer phobia" by some 
examinees . 

Two Issues related to test administration will be discussed 
In this sec<"lon. The first Issue concerns adaptive testing, 
which became feasible because of the availability of the 
computer. The second Issue addresses vhe Impact of technology 
of test administration* 
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The major advantage of adaptive testing is the reduction of 
test administration time without the sacrifice of measurement 
precision (W. C. Ward, 1984). Adaptive testing procedures 
consist of three components: an item selection routine, an 
ability estimation technique, and a stopping rule. 

Stocking and Swanson (1979) outline the typical adaptive 
testing algorithm as follows: 

1. Obtain an initial estimate of the examinee's ability 

level . 

2. Use this estimate to select an appropriate item from 

the item pool. 

3. Administer and score the item. Use this information to 

revise the estimate of trait level. 

4. If the estimate is satisfactory, stop. Otherwise, 

further refine the estimate by returning to Step 2. 

Several procedures exist for selecting appropriate items 
(e.g., Samejima, 1977, Lord, 1971, and Wald, 1947). Typically, 
an incorrect response is followed by the administration of an 
easier item, and a correct response is followed by a more 
difficult item. The precise characteristics of the item (i.e., 
item difficulty and information) are dependent upon the 
selection algorithm. For example, selection rules may include, 
but are not limited to, the following procedures: 

(a) For Bayesian updating, items with the highest 
discriminating power are selected since these items reduce the 
posterior variance. 
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(b) Items with maximal information are selected when 
maximum likelihood ability estimation is used. (Green i Bocki 
Humphreys, Linn, & Reckase, 1982) 

Ability estimation procedures are numerous including the 
confidence interval approach, point estimation based on 
regression, maximum likelihood estimation, and Bayesian 
estimation approach (Weiss, 1974). 

Although many stopping rules have been used, the three most 
frequently implemented rules include the following: 

1. Stop when a fixed number of items has been 

administered, 

2. Stop when all items with maximum information at the 

current ability estimate have been administered, and 

3. Stop when a stable ability estimate has been obtained. 
Because of complicated ability estimation procedures used in 
adaptive testing, most computer applications of adaptive testing 
have been implmented on mainframe computers. 

Implementation of adaptive testing on microcomputers should 
be possible with the new development in hardware and software. 
Baker il984) pointed out several technological trends, which 
should have implications on test administration: (a) 32 bit 
Internal registers and 16 bit addresses and data bases, (b) new 
optical storage devices, (c) video disks, (d) device to scan 
graphic material and software to create graphic material, (e) 
voice input and output devices, (f) computer networking, and (g) 
new software development. These trends may not only enhance the 
capability of test administration, they may also help to 
revolutionize the types of test items. 

O .17. 22 
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In designing both adaptive and noadaptive tests to be 
administered by computers, several guidelines are available 
(Brightman, Freeman & Lewis, 1984; Mlzokawa & Hamlin, 1984; 
Wedman & Stefanich, 1984). These guidelines argue how 
instructional strategies, psychometric theory, and technology 
may work together to administer tests effectively. In general, 
technology should be used to facilitate rather than to handicap 
students' responding processes. Therefore, the format, the 
rate, and task demands of item presentation should be designed 
In such a way that errors in using hardware to respond can be 
minimized. 
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III. APPLICATIONS AND IMPLEMENTATIONS 



Item Construction 

Many computer-assisted test construction packages which do 
not Include actual Items are equipped with Item generators. 
There are two basic types of Item generators: (a) those that 
store parts of an Item (e.g., the stem and the options) and 
Include rules for combining these parts to construct a whole 
Item and (b) Item generators based on Item forms which consist 
of rules for generating Items. 

This first type of Item generator was used In the Question 
Pool Management System (QPMS) (Denny, 1973). This system 
requires users to break up each Item Into three components: the 
Item stem, seven possible correct answers, and seven possible 
dlstractors. For any Item selected for Inclusion In a test, the 
generator randomly selects one correct option and four 
dlstractors. 

The more frequently used computerized method of 
constructing Items relies on Item generators which construct 
Items from Item forms (Hlvely, Patterson, & Page, 1968). The 
Individually Prescribed Instruction (IPI) mathematics programs 
(Hsu & Carlson, 1973) u«^ed this approach. Each unit was 
comprised of several objectives and each objective was divided 
Into several Item forms. 

Mlllman and Outlaw (1978) have developed a system which 
enables users to construct test Items using Item programs. 
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These item programs are another type of item generator. The 
item programs must be written by the user using an expanded 
version of BASIC. Using a sample item as a guide, variations of 
the item are then produced by breaking up the item into logical 
segments. For each segment, sets of alternate words are 
constructed. For instance if an item is to test knowledge of 
characteristics of different plants, the item program may be 
written to select among five different plants and three 
different characteristics. With the item program, the user can 
specify when a segment is to be randomly generated from among 
the segrments or if the generation of one segment is dependent 
upon the variations previously selected. Possible answers are 
also selected by the item program. Answer choices can also be 
selected on a random or conditional basis. This system has been 
discussed and illustrated in Millman (1980, 1982). 

Instead of merely using computers to perform permutations 
and combinations of content preconstructed by the item writer, 
Millman and Westman (J. Millman, personal communication, 1985) 
are designing an improved system which incorporates some 
artifical intelligence capabilities. In using this system, item 
writers can be assisted by the system interaccively in various 
ways: (a) examining prototype items measuring the desired 
processes, (b) offering prompts based on users' needs, and (c) 
accessing a number of available system libraries. 

So far, we did not find any application of microcomputers 
in item construction in the literature. This is probably due to 
the limitation of storage space of the first generation of 
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microcomputers. With the development of the second generation 
of microcomputers and the progress in artifical intelligence, we 
may anticipate more applications in this area (Roid, 19B4a) . 
Although several articles discuss the possibility of automating 
item construction processes (e.g., Millman, 1980), more 
theoretical work in item writing techniques is still needed. 

Item Banking 

Item banking systems tend to fall into one of two 
categories: those equipped with and designed for a specific set 
of itema and those that require users to enter and store their 
own items. To be Included in this section, an item banking 
system must meet the requirements of flexibility stated earlier 

MEDSIRCH, used by medical schools in Canada (Hazlett, 
1973), is an item banking system which requires users to create 
and store their own items. MEDSIRCH allows a maximum of 57 
variables or codes to describe each item. These include such 
things as area of specialty and subspecialty, degree of 
importance, difficulty level, history of use, and even type of 
audiovisual equipment necessary. The user creates the bank by 
preparing items on keypunch cards and submitting the cards to a 
series of programs which check, catalogue, and eventually store 
the items on tape. A similar system developed at Iowa State 
University (Menne, 1973) was implemented because many 
instructors were filling out their own items on index cards or 
IBM cards and using the text editor of the university's computer 
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to prepare tests from the item cards. With this system users 
have the option of establishing their ovm item banks or using 
one of six existing banks on the system. 

TICAT (Tuskegee Institute Computer-Assisted Tester), an 
interactive computer-assisted testing system can be used to 
develop item banks, create and administer tests, and score and 
report test results (Howze, 1978). The system was viritten in 
Time-Share BASIC for a Hewlett-Packard 2000-ACCESS system. The 
item bank component can be used to develop item files containing 
a maximum of 128 true-false or multiple-choice items. An item 
file is the actual set of items for an examination. It is not 
the set of items from which the exam items will be selected. 
When items are entered in the file, the system prompts users for 
the item text, the correct answer, and a citation or textbook 
reference. Storage space is reserved for additional information 
regarding item usage. Item information may also be edited and 
uriated. A COPY routine can be used to combine all or portions 
of item files together to form new files. Since item files are 
actually exams, this feature allows considerable flexibility in 
combining several sets of items covering different content 
areas. A LIST command, which allows users to print all or part 
of an item file. Is also included. This command, however, does 
not produce the final printed copy of the test. This function 
is performed by a separate componekit of the system. 

Both a manual and a computerized Item bank can be used with 
CIBBDS, the Computerized Item Banking and Exam Development 
System (Vale, 1979). The manua:i card file was developed as a 
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first step In designing the bank. The bank consists of 
approximately 15,000 items used by the Minnesota State 
Department of Personnel for personnel classification tests. 
Items were classified using a nine-digit code based on the Dewey 
Decimal system. The computerized bank (written in FORTRAN for 
the Control Data Cyber 76 time-sharing system) is capable of 
storing items and item statistics, modifying items, selecting 
items based on item content and/or item statistics, and 
formatting items in a photocopy-ready form. 

The computer-assisted test construction system developed by 
Stock, Esterson, and Schmid (1977), has integrated both item 
banking and test design capabilities. In creating the item 
bank, the user is required to define a two-way test 
specification table. Items are then referenced to this test 
blueprint. The system is capable of (a) adding items to the 
bank via batch processing, (b) generating tests by obtaining a 
stratif iedrandom sample of items from the user-selected cells of 
the test plan, and (c) editing the item bank. The item bank 
contains item analysis data including Icem difficulty and 
discrimination, the keyed response, and a reference to the table 
of specifications. Items are selected in an interactive mode, 
whereby users specify cells in the specifications table and the 
program responds by indicating the number of items in the bank 
classified for each selected cell. The system is also capable 
of generating parallel forms of the test. The system was 
designed on a Univac 1110 Exec 8 system and includes an item 
bank containing over 900 measurement and statistics items 
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classified by 63 content areas and three skill levels (e.g.* 
facts » principles, and applications) • 

The science question bank developed for the Assessment of 
Performance Unit in the Department of Education and Science » 
England, is a comprehensive system designed specifically for 
monitoring science performance (Johnson & Maher, 1982). The 
Information retrieval system for this bank was implemented on 
the AMDAHL V/7 computer at the University of Leeds using the 
CODIL programming language. The system consists of (a) a 
BREAKDOWN procedure, which can summarize items according to 
major- and sub-categories, (b) a BROWSING procedure, which 
allows a user to specify the characteristics of items to be 
reviewed, and (c) a TEST CONSTRUCTION procedure, which assists a 
user to select items to form a test. A random question may be 
rejected if it violates any of the conditions (range, frequency, 
inclusive/exclusive) prescribed by the user. The browsing 
capability of the system has been enhanced greatly with the 
addition of a thesaurus access option to the system (Johnson & 
Maher, 1984). 

Many microcomputer item banking systems appeared during the 
last few years. In their review. Deck and Estes (1984) have 
Identified at least 75 packages. Since microcomputers are 
accessible to most people and most instructors need some place 
to store Items, no wonder so many packages were developed. 
Several reviews on item banking for microcomputers have also 
appeared during the last few years (Deck & Estes, 1984; 
Hambleton, 1984; Hsu & Nitko, 1984). We will not attempt to 
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present an extensive review of existing microcomputer-assisted 
item banking packages here. Rather i in the remainder of this 
section we focus on several microcomputer applications to 
illustrate what we consider to be desirable characteristics of a 
microcomputer item bank that can be used to construct tests. 

ITEMBANK, designed by Bowers (1984) at t^e American College 
Testing Program, is a relatively large and complicated Item bank 
that may be appropriate for state and local agencies. This bank 
was established, maintained, and updated using dBASE II. The 
entire test development process was divided into four stages: 
the Draft Test Stage, the Working Test State, the Final Test 
Stage, and the Item Analysis and Updating Stage. There are 
sub-menus for each stage. The Working Test stage and the final 
test stage are also relevant to test construction. In the 
Working Test Stage, reviewers' comments are incorporated and new 
item data are entered into the data-base. Items are revised, 
assembled, and sent to reviewers for further evaluation. In the 
Final Test Stage, items are further revised. A scoring sheet 
and a test summary sheet are printed. 

This bank was implemented on an IBM PC with color monitor. 
It is a good example of using a computer to assist in the 
construction of tests, not Just using the computer to store 
items. This system may not be appropriate for individual 
teachers, however. In addition to the requirement for dBASE II 
software, teachers without computer training may find the system 
too complicated. In fact, most teachers may not really need 
such a large system. dBASE II also has been used by DeGruijter 
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(198S) to develop an item banking system which can incorporate 
test analysis results into the item bank. 

Bowers also produced a newer version of the system i 
entitled dBANK, using dBASE III (Bowers, 1985). He claimed that 
dBANK should be considered as a totally new system. A primary 
difference between dBANK and its predecessor ITEMBANK is that 
dBANK emphasizes functions rather than tasks. The four menus in 
dBANK contain: (a) data file maintenance functions, (b) draft 
test development functions, (c) report printing functions, and 
(d) data communication functions. Another change is the 
redesign of programs in order to take advantage of dbase Ill's 
increased speed. Both screen displays and data base files were 
redesigned to take advantage of the new features and 
capabilities of dBASE III. In conclusion, the developer 
emphasized that *'dBANK ensures data integrity and makes the data 
management and reporting associated with test development far 
more accurate and efficient." (p. 12) (Other systems of similar 
capabilities: W. H. Ward, 1984; Hiscox, 1984b.) 

MicroCAT (Assessment Systems Corporation, 1985) and the 
Pittsburgh Educational Testing Aids (PETA) System (Nltko & Hsu, 
1984a) are comprehensive testing packages which contain rather 
extensive item banking components. A somewhat detailed 
description of these item banking components is presented 
below. MicroCAT (Assessment Systems Corporation, 1985) can 
assist users to develop, administer, score, and analyze 
computerized tests. Designed for the IBM series of 
microcomputers (PC, XT, and AT), the package consists of four 
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subsystems: development « examination* assessment, and 
management. The development subsystem Includes five programs 
entitled Graphics Item Banker, Pont Generation, Test 
Specification, Text Editing and Test Compilation. The first two 
programs can be used to enter, retrieve, and modify test items 
and instructions. A brief description of their features appears 
below. The last three programs can be used to assemble tests. 
Their characteristics are summarized in the Test Design 
section. The examination subsystem can be used to administer a 
test to a single examinee or a group of examinees. Examinees* 
responses may be cumulated for later analyses by the assessment 
subsystem. The assessment subsystem can be used to evaluate the 
performance of items/tests using both classical item/test 
analyses and item response theory. Since these features and the 
functions of the management subsystem are not included in our 
definition of a computer-assisted test construction system, they 
are not discussed in this paper. 

With the item banking programs (i.e., Graphics Item Banker 
program and the Font Generation Program) this system is capable 
of storing up to 14 item characteristics such as display time, 
correct response, and estimates of item parameters. In 
addition, 22 special graphics commands are available for 
creating graphic items. These commands can be grouped into five 
categories: geometric primitives for drawing standard shapes, 
text commands, additional drawing commands, graphics segmenting 
commands, and utilities commands. These commands should be 
sufficient for most common uses. Items entered by this system 



32 

-27- 



also can be organized according to content areas and grouped 
Into separate directories. This system has Incorporated 
up-to-date measurement principles. However, It appears that a 
sufficient knowledge of measurement Is required for efficient 
and effective use of this system. 

The Item banking component Included In the Pittsburgh 
Educational Testing Aids (PETA) system (Nltko & Hsu, 1984b), Is 
specifically designed for Individual teachers. Although this 
component and two other components (I.e., student data-base and 
Item analysis) form the PETA system, the Item banking component 
can be used Independently to maintain test Items and to 
construct classroom tests. This system was lmplementf:d for the 
Apple II plus and Apple He with 48K memory. For Item banking, 
two disk drives and a printer are required. 

There are 11 programs In the Item banking component plus 
one option for terminating the execution. Since the main menu 
reflects the capability of the system, the options available are 
listed here: 1. CREATE/ENTER INTO AN ITEM BANK: A TEST ITEM 
AND ITEM DATA; 2. CREATE/ENTER INTO AN ITEM BANK: A TEST 
DIRECTION; 3. RETRIEVE PROM AN ITEM BANK: A TEST ITEM AND ITEM 
DATA; 4. RETRIEVE PROM AN ITEM BANK: A TEST DIRECTION; 5. 
REORGANIZE AND COPY: AN ITEM BANK TO SAVE SPACE; 6. TRANSFER 
ITEM STATISTICAL DATA: Pl^OM A DATA-BASE TO AN ITEM BANK; 7. 
TERMINATING; 8 RETRIEVE PROM ITEM BANK: ITEMS MEETING YOUR 
CRITERIA; 9. RETRIEVE PROM A TEST FILE: A TEST ITEM FILE AND 
ITEM DATA; 10. RETRIEVE PROM A TEST PILE: A TEST DIRECTION; 
11. ESTIMATING THE PROPERTIES OF: THE TEST ON A TEST FILE 
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DISK; AND 12. PRINT ITEMS/DIRECTIONS IN AN: ITEM BANK AND/OR 
TEST FILE (Nitko & Hsu, 1984a). 

The item bank component consists of several special 
features. It does not utilize separate word processors for item 
creation and modification. However, several special commands 
are included to facilitate word processing. This bank can store 
five most commonly used item types: multiple-choice, 
true-false, matching, fill-in, and essay questions. The maximum 
length for an item is 1522 characters (including spaces), but 
the random access file record length is only 122 characters. 
This implies that a longer item may be placed in more than one 
record. For shorter items, however, no wasting of space is 
necessary . 

Since items may be retrieved either by criteria or item 
identification numbers (representing contents), the user has 
full control of the items tc be selected. In retrieving items 
according to criteria, the user may modify the criteria in order 
to increase or decrease the number of items to be assembled in 
the test file disk. The user may examine or modify an item 
either when it is in the bank or after it has been selected and 
placed into a test file. Item data, either classicial item 
statistics or estimated item parameters used in item response 
theory, are also presented to the user simultaneously with the 
item. When a test is assembled, the user may request to 
estimate the quality of the test using item statistics obtained 
previously. When a test is ready to be printed, items may be 
ordered according to content, difficulty, or any order specified 
by the user. 




Since this system was designed specifically for individual 
teachers and restricted to a minimum hardware requirements, it 
cannot include any graphics and symbols in the items. The 
response time is not the fastest, but it is fast enough for most 
common usage. The classification scheme for items in the bank 
is especially appropriate for classroom testing. 

Since the application of item response theory in building 
item banks is becoming popular, a framework proposed by Wright 
and Bell (1984) is described here. This framework has been used 
to build item banks used at several school sites. It consists 
of three components. The first component is Bank Plan. Program 
Foxm in this component is used to decide what item will be 
included in *j j:>^^^tic^3ar form. The second component is Test 
Administration, where tests are administered externally and 
responses are obtained. The final component is Bank Building. 
In this component, the program PORCAL calibrates items using the 
Rasch model. Then a series of fit analyses are carried out by 
the program SHIFT. Formulas needed for the analyses are alco 
provided. An important contribution of this framework is the 
emphasis on the psychometric aspect of item banking. 
Calibrations of items are incorporated into the process of bank 
building. 

Test Design 

The selection of items for a test is dependent upon the 
user's content specifications and the item selection algorithm 
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of the individual computer-assisted test construction system. 
The typical system requires users to indicate the number of 
items desired for their test and specify item selection 
restrictions by identifying the type of items needed. 
Generally, items may be specified using any variable by which 
the items are classified in the bank. The item bank is searched 
and all items which meet the user's criteria are noted. Most 
programs then randomly select from among the items that satisfv 
the user's restrictions (e.g.. Baker, 1973; Libaw, 1973; 
Toggenburger, 1973). The level of specificity of restrictions 
is obviously related to the level of specificity of item 
classifications in the item bank. Programs with very crude 
classification systems allow the user to specify actual item 
numbers only (Brown, 1973; Menne, 1973). On the other hand, if 
the item bank has an elaborate classification systen, many more 
restrictions may be specified. 

Both MEDSIRCH (Hazlett, 1973) and CTSS (Toggenburger, 1973) 
prioritize item selection criteria. An initial search of the 
item bank is performed and the number of items meeting all 
criteria is noted. If the number of items satisfying these 
criteria is less than the requestednumber of items on the test, 
one criterion is dropped and the bank is searched again. This 
process is continued until enough items are identified for 
selection. The MEDSIRCH and CTSS packages differ in that 
MEDSIRCH allows users to prioritize their item selection 
criteria whereas CTSS has established its own prioritization. 
Behavioral level criterion, which CTSS defines as knowledge or 
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application of knowledge, Is dropped first, followed by Item 
difficulty. Slvertson, Hansen, and Schoenenberger (1973) 
describe a unique test design system designed to "Identify the 
continuing education needs of individual physicians" (p. 38). 
Their comprehensive item bank contains 2020 five-option 
multiple-choice items covering "all diseases a physician might 
encounter in his practice" (p. 39). Items were first classified 
using the International Classification of Diseases, Adapted 
(ICDA) codings. The authors then added both specialty codes, 
such as General Practitioner (GP), Internal Medicine (IM), 
Pediatrics (P), and General Surgery (GS), and three skill level 
codes as follows: 

Level 1: a common clinical situation and "on the spot" 
decision 

Level 2: a decision requiring commonly available 

diagnostic tests and procedures 
Level 3: a problem or technique requiring specialized 
training or diagnostic tests to manipulate 
information (p. 39) . 
To select a subset of items for a test the physician (user) must 
indicate his/her area of speciality. From the items that match 
the physician's specialty, a random sample of items is drawn 
from each skill category. 

The SOCRATES' computer-assisted test retrieval system is an 
extensive computer network consisting of 11 item banks and over 
10 1 000 items which are available throughout the 19 campuses of 
the California State University and Colleges system (Seely & 
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Willis, 1976) • Items can be selected by subject category, 
difficulty level, behavior level (classified as either knowledge 
or application), and/or keyword. Maximum test length is set at 
150 items. The system is capable of modifying a test 99 times 
and producing up to 10 scrambled forms of a test. Since the 
system is available to both faculty and students, students may 
request practice tests. The unique feature of this system is 
its networking component. When a test is designed the printed 
copy of the test can be produced at the site of origin if a 
high-speed printer is available. In addition, tests can be 
requested by telephone, assembled at the central processing site 
in Los Angeles and delivered to the campus via a courier 
service. Any campus which has a direct link with the central 
processor can design and print tests at that site. 

In summary, the item selection strategy used by most 
earlier test design systems employing mainframes is to select 
items meeting various user-specified restrictions. When the 
system includes an item bank, the degree of specificity of test 
characteristics is dependent upon the classification scheme of 
the item bank. 

Instead of selecting items based on user-specified item 
characteristics, item response theory may be employed to design 
tests. One example is the IRT Test Design System (Sadock, 
1984). The system can be used on an Apple II Plus or Apple He 
with a minimum of 48K memory. One disk drive and a printer are 
required. 
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This system was designed as a tool for selecting a set of 
items for a relatively short » yet efficient test, without 
reqpiiring that users possess a complete working knowledge of 
test development theory* There are five components to this 
system: (a) the test content specification component, (b) the 
test use component, (c) the test construction component, (d) the 
test modification component, and (e) the technical information 
component. Only the test use and test construction components 
are described here. 

Similar to many test design systems, the test content is 
specified by selecting individual or groups of objectives or 
content areas. Many computer-assisted test construction systems 
proceed at this point by selecting a random sample of items 
appropriate to the content domain. The IRT System, however, 
requires that users indicate how the test scores will be used. 
This is accomplished via the test use component. At present, 
the system can design three types of tests, each serving a 
different purpose. The three test types include: (a) tests 
designed to group students (typically referred to as 
classification tests}, (b) tests designed to rank students, and 
(c) tests designed to assess individual student mastery 
(typically referred to as objective mastery tests). The test 
use component provides users with nontechnical descriptions of 
these different test uses. 

The item selection strategy for all three types of tests is 
based on the three-pard*r.eter logistic model. (Calibrated item 
data are included in the bank that accompanies this system.) 
When a grouping test is requested, the number of groups to be 
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formed as well as the percent cutoff-scores must be specified. 
Items are then selected if their point of maximum information 
falls within an acceptable range of the theta-equlvalent of any 
cutscore. 

When a test designed to rank students is requested, an 
estimate of the class ability level must be specified. This is 
necessary to insure that the items selected provide adequate 
informat-ton for ranking the entire group. By specifying class 
ability level, the user is actually specifying acceptable values 
of the maximtim information levels of the items. 

If a test designed to assess individual student mastery is 
requested, items of varying degrees of difficulty across all 
ability levels, but which explicitly represent the content 
characteristics of the domain, are selected. Note that there is 
no mastery cut score associated with this type of test. The 
purpose here is to determine the proportion of the content 
domain which each student has mastered. 

At present, some components of this system are specific to 
the accompanying item bank. The second version of the system, 
however, will contain utility programs for adapting this system 
to any three parameter IRT calibrated item bank. 

The Rasch model has also been applied to test design 
systems. The icem bank and score conversion program described 
by Haksar (1983) were designed based on the Rasch criteria of an 
efficient test. From previous test results, the class score 
distribution (including the mean and standard deviation), can be 
approximated. According to the Rasch model, the test mean 




should equal the average item difficulty, and the distance on 
the ability scale between the easiest and the most difficult 
item should be four times the standard deviation. Using these 
criteria as well as estimates of the score distribution, the 
user selects items from the item bank. 

The item bank contains item difficulty measures in addition 
to content codes. At present the item bank is not in computer 
form. Rather, item selection is accomplished manually by 
inspecting either an item catalogue or an indexed set of item 
cards. The item cards are arranged in order of difficulty 
within content areas. 

In order to translate a raw test score into a scaled score, 
a score conversion program has been written for an Apple. This 
program places the raw score o:«to the same ability/difficulty 
scale that was used in defining item difficulty. 

One important feature of MlcroCAT (Asssssment Systems 
Corporation, 1985) is the test specification procedure. Six 
predefined templates are provided. The templates are incomplete 
test blueprints which enable the user to specify ^,he items and 
requirements of the test. The templates permit th^^ users to 
assemble a fixed-length conventional test, a variable length 
conventional test, or a variable-length adaptive testing using 
either Bayesian, maximum likelihood, or stratified adaptive 
decision strategy. Normally these templates require the user to 
identify items to be included and criteria required for the 
selected testing procedure. If none of the predefined templates 
is appropriate, users may create their own templates by using 
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the Minnesota Computerized Adaptive Testing Language. These new 
templates, however, must be compiled before they are used to 
design tests. 

Test Administration 

Adaptive testing . Many adaptive testing systems have been 
developed during the last decade (Clark, 1976; Weiss, 1978, 
1980, 1983). Most of the systems were designed for research 
purposes rather than for actual implementation. Also, most of 
the systems concern aptitude measurement. Only recently has 
attention shifted to achievement testing (Be jar, Weiss, & 
Kingsbury, 1977; Brown & Weiss, 1977; Weiss & Kingsbury, 1984). 
Two examples of adaptive aptitude testing are discussed below. 
Adaptive achievement testing is discussed in more depth in the 
next section on Evaluation and Research. 

Unlike many adaptive testing systems, TAILOR (Cudeck, 
Cliff, & Kehoe, 1977; McCormick & Cliff, 1977) does not require 
extensive pretesting of items. Rather, using the tailored 
testing approach by Cliff (1975), TAILOR estimates both item and 
person characteristics simultaneously. There are two versions 
of TAILOR. TAILOR-APL (McCormick & Cliff, 1977) is used for 
individual administration and the FORTRAN version (Cudeck, 
Cliff, & Kehoe, 1977) is designed for group administration with 
a minimum of 15 examinees. As more students are tested, more 
accurate difficulty estimates are made, thereby resulting in a 
more individually tailored test administration. McCormick and 
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Cliff (1977) claim that after six administrations, there is a 
significant reduction in the number of items administered to 
subsequent examinees. 

The FORTRAN version of TAILOR begins by administering the 
same item to all examinees. An implied ordering of items is 
performed based on the observed numbers of correct and incorrect 
responses. Examinees' responses are also used to award 
examinees with a correct response to easier items. The process 
continues by matching item difficulty estimates with the 
examinee's performance on previous Item. 

The Broad Range Tailored Test of Verbal Ability (BRITTVA) 
(Lord, 1977) is an excellent example of an adaptive test 
administration system. While implementing adaptive testing 
strategies, the BRTTVA can be used to assess verbal ability from 
the fourth grade level to the graduate level. In addition, 
parallel test forms may be generated. 

Nonadaptive testing . The first example is a system used by 
the School of Basic Medical Sciences at the University of 
Illinois, Urbana-Champaign. Students are directly involved in 
the administration of diagnostic assessment examinations. 
Students must take nine comprehensive exams each containing 
approximately 180 items covering an individual clinical 
problem. Four to five hours are needed to complete the exam. 
Since students work through the curriculum at their own pace, 
all students will not necessarily be taking the examination at 
the same time. 



The test administration system, named LEVELS, Is written in 
TUTOR for PLATO IV implemented on a CDC 7600 (Sorlie, Essex, & 
Shatzer, 1979). Exams are administered via a PLATO IV 
terminal. First, students must schedule their exam using the 
"Level III Scheduler" program. Students specify a test date and 
time as well as total testing time required. At the scheduled 
time of the test, the student logs on the system and is 
presented with examination instructions. Once the exam is 
specified, a list of disciplines covered in the exam is 
presented to the users, which includes the number of items 
within each discipline. The student then specifies the sequence 
in which he/she would like the disciplines to be presented. 
During the exam, the student still has some control over the 
order of administration. Students have the option of omitting 
itemfi and receiving a zero score, or skipping an item and 
returning to it after attempting all items in the current 
discipline. At the end of each discipline, students' scores are 
presented to them. Before proceeding with the next discipline, 
questions answered incorrectly may be reviewed and a second 
answer may be selected. This process allows students to raise 
their scores. The scoring component of this system keeps track 
of all student responses including a record of items answered 
correctly on a second atter.pt. 

Most test administration applications for a microcomputer 
can be classified as either page-turners or drill and practice 
exercises. In this format, test items are presented on the 
monitor and examinees respond to items one at a time. Other 
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than that, features of traditional paper and pencil testing 
remain. A more appropriate use of microcomputers for 
administering tests is currently being field trsted at several 
colleges across the country. Educational Testing Service (ETS) 
and the College Board have developed a system for administering 
both conventional and adaptive tests (Ballas, 1984). The 
current emphasis is to provide institutions with a tool for 
administering and scoring placement tests in a relatively short 
period of time. 

There are a few recently developed test administration 
systems which use a somewhat traditional format in the 
administration of the test, while applying many of the advances 
in computer technology to the scoring and reporting aspects of 
the test. Although the development of these systems was not 
centered around new applications of measurement theory or new 
measurement theories, they do seem to illustrate the newest 
trend in computerizing test administration. 

KEYWAY, a test scoring and reporting system developed by 
ETS, is characterized by many of the advantages of 
computer-assisted test administration systems, cited previously. 
Rather than employing computers in the test administration 
phase, however, KEYWAY uses microcomputers in recording answers, 
scoring, and reporting results. The Center for Occupational and 
Professional Assessment (COPA) uses KEYWAY for several licensing 
examinations, including the Real Estate Licensing Examination 
(RELE). When preregistered candidates report to a KEYWAY 
testing center, they receive a standard, printed test booklet 
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and a KEYWAY Answer Pad. The Answer Pad is not used to display 
the test content. Rather, its primary purpose is to record 
information. The one-line LCD panel displays each item number 
and waits for the students to respond with either an answer 
choice or a i^equest to advance to the next item. Once all items 
have been attempted, the candidates may review all skipped 
items. At the end of the test, all demographic information and 
item responses are transferred to the Memory Module which is a 
portable, transf errable unit that resides in the Answer Pad 
during the exam. Upon completing the exam, the candidate 
returns the Answer Pad to the test administrator. Scoring is 
accomplished by removing the Memory Module from the Answer Pad 
and inserting it into the Memory Reader which reads the 
responses and downloads the information to an IBM PC. The test 
is then scored and a printout of results is produced. 

During the fall of 19B5, COPA pilot tested a second 
computer-assisted test recording, scoring, and reporting system 
which will be available for candidates of the National 
Association of Purchasing Management (NAPM) Examination. 
Although quite similar to KEYWAY, this system has some unique 
features. The hardware requirements for this sytem include the 
Radio Shack TRS 80 Model 2 or Model 12, a printer, and two 
integral or separate disk drives. Although candidates receive 
standard printed test booklets, directions and items also appear 
on the monitor. Rather than using a Memory Module type device, 
standard floppy disks are used for permanent storage of test 
information. When each module of the exam is completed, the 
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test Is scored and two copies of the score report are produced - 
one for the candidate and one for ETS use* 

For the past two years, the American College has been 
offering computer-based examinations through their Examinations 
on Demand (EOD) program. Prior to 1982, students enrolled In 
the Chartered Life Underwriter (CLU) or the Chartered Financial 
Consultant (ChFC) program, were required to pass 10 nationally 
administrated paper-and-pencll examinations. These exams were 
offered twice a year. Since the EOP program began, candidates 
have the option of taking the tests In standard written format 
on the predetermined test dates or requesting the 
computer-administered version of the exam at a time which fits 
their own schedules. 

The EOD exams are administered through the Control Data 
Corporation (CDC) Education Center network, which houses PLATO 
terminals. Through the CDC network, candidates have 
substantially more flexibility In scheduling both the time and 
location of the exam, since Education Centers are currently 
located in 35 states across the country. 

Similar to many computer-administered testing systems / the 
CLU and ChFC exams are scored immediately upon completion and 
the candidates leave the test site with their scores in hand. 
Nungester and Vaas (1984a) report that in the first two years of 
the program, over 16,000 candidates have participated in the EOD 
program. Characteristics of participants in the EOD program are 
continually being examined (Nungester & Vaas, 1984b). It is 
hoped that by examining the characteristics of students who opt 
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for the EOD system, insight may be gained as to the acceptance, 
advantages, and disadvantages of the current system. 

It should be noted that the computerized version of The 
American College tests employs item selection strategies based 
on both item difficulty and test content. 
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IV. EVALUATION AND RESEARCH 



The purpose of this section is to discuss issues or studies 
dealing with the evaluation of test construction systems as well 
as using the proposed testing systems to study test construction 
problems. If the studies investigate testing issues that may 
have implications for computer-assisted test construction, 
though they may not involve any computer-assisted test 
construction systems, they are included in this section. 

To use computers to assist in test construction, we must 
make sure the quality of the test construction process will not 
be compromised. Therefore, evaluation of any test construction 
application is not only necessary, but also indispensable. 
Unfortunately, thorough evaluation and research are not usually 
done before a system is on the market for distribution (Deck & 
Estes, 1984). This could be attributed to several factors. 
First of all, to sell a product in a competitive market, timing 
is very crucial . Most developers cannot wait for a long delayed 
evaluation to be carried out. Secondly, the technology is 
changing so rapidly. If a system is not on the market right 
away, new technology may make the system obsolete. Another 
reason could be the users* fault. Users are so fascinated by 
the technology, they tend to ignore how the system contributes 
to educational testing. 

A computer-assisted test construction system must be 
evaluated from three viewpoints. These three perspectives, in 
order of priority, are (a) the measurement specialists* view, 
(b) the users* (teachers') view, and (c) the computer 
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specialists' view. Measurement specialists must make sure the 
quality of the test construction process Is maintained or 
Improved. If a system Is not theoretically sound from a 
measurement perspective, It should not be Introduced to the 
users. Users are responsible for determining whether the system 
Is appropriate for their Intended clientele In terms of ease of 
use and meeting their needs. Computer specialists* 
responsibilities are to make sure the system Is running smoothly 
and efficiently* But efficiency should not override measurement 
quality and usability. 

As with any ^roduct, two different evaluations should be 
conducted: formative and svunmatlve. During the formative 
evaluation, data should be collected for the purpose of 
Improving the system to make sure It functions as Intended. The 
summatlve evaluation should consider whether the Implementation 
of the system Improves the quality of the test construction 
process by the users. Both types of evaluation data should be 
available to the clients before a product is distributed on a 
large scale. 

So far, with the exception of adaptive testing, 
computer-assisted test construction systems' evaluation data are 
rather scarce. Although published reviews are available from 
some Journals (e.g.. Educational Technology^ Social Science 
Microcomputer Review ) , they cannot be used as a substitute for a 
formal evaluation* Normally, these reviews are only Intended to 
serve as a buyer's guide* To examine the quality of a system, 
users must demand evaluation data obtained through a formal 



-46- 

50 



process. Appropriate evaluation Instruments, such as the one 
cited In Hsu and Nltko (1983), or the user's evaluation form 
proposed by Ju (1984) should be used to collect evaluation 
data. Section I of the user's evaluation form proposed by Ju Is 
given In Appendix A. The 30 statements were designed to measure 
the following aspects of the computer package: usefulness, 
efficiency, documentation error handling, and performance. 
Section II Is a series of open-ended questions. This section 
allows users to add any additional comments wnlch are not 
addressed In Section I. 

Whether the capabilities of computers have been used 
efficiently and effectively can be evaluated by computer 
specialists. Consumers of testing systems must consider both 
hardware and software capabilities In terms of Intended uses. 
Readers Interested In criteria for hardware and 
software selections may wish to consult guidelines published In 
various journals (e.g., Hlscox, 1983, 1984a). 

Item Construction 

Since this area Is still In a rather primitive stage, 
literature concerning the evaluation and research of 
computerized Item writing procedures Is relatively scarce. In 
this section, we will first briefly illustrate relevant 
evaluation Issues by using the works of Mlllman (1982). Then 
the focus will turn to research on item writing techniques which 
may not yet be computerized. 



The computer-based test construction system contructed by 
Mlllman and Outlaw (1978) was Implemented In an Introductory 
statistics course using a mastery learning strategy. The 
evaluation focused on both the system and student attitudes and 
learning. Mlllman (1980) reported that the computer programs 
had produced all features anticipated and then ran smoothly 
without any detectable errors. Major drawbacks of the system 
concern the specific configuration of computer hardware and poor 
docximentatlon. These drawbacks limited the transportability of 
the system. 

Using such a system produced some posit Iv: Impacts both In 
Instructional processes and student attitude. The Instructor 
had to prepare more thoroughly In terms of what should be taught 
and assessed. The students showed positive attitudes toward the 
mastery test approach. Final examination scores of students 
Involved In this approach, however, failed to demonstrate 
superiority In comparison with the scores of students Involved 
In the traditional approach. The researcher has attributed this 
finding to the limitation of the criterion measure employed. 
Unfortunately, the quality of Items generated by the system was 
not reported. The evaluation of the system would be even better 
If other Instructors were Involved. 

Research Issues concerning Item Wx'ltlng techniques can be 
Illustrated by the study conducted by Rold and Finn (1978). To 
assess the feasibility of employing the linguistic 
transformation approach In Item construction, computer-based 
algorithms were developed and used to analyze prose subject 
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matter. High information words were identified from prose. 
Sentences containing these words were transformed into 
multiple-choice items by item writers who generated alternatives 
using an informal approach and by an algorithmic approach. 
Items from these two approaches were compared and evaluated 
using data obtained from the try-out of the items. Results 
showed that both types of items were equally effective in 
measuring learning. Items derived from key word nouns tended to 
produce low quality items. The authors concluded that the 
algorithmic approach is feasible in generating foils for 
multiple-choice items. 

Before item writing techniques can be actually 
computerized/ more studies like Roid and Finn (1978) are needed 
to determine which aspects of each item writing technique can 
best be done by a computer and which aspects the computer cannot 
perform as well. We should implement only the ones that can 
produce quality items. 

Item Banking 

Researchers interested in item bank evaluations may wish to 
check the following sources for ideas: Hiscox and Brzezinski 
(1980), Hiscox (1983), Deck and Estes (1984) « Estes and Arter 
(1984), Baker (1972), Millman and Arter (1984), and Hsu and 
Nitko (1984). these studies do not deal directly with item bank 
evaluation. However, in their discussions of the requirements 
for a good item bank or in their reviews of item banks, they 
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mention criteria that may be useful in item bank evaluation. If 
we know the requirements for a good item bank, we should be able 
to identify the criteria that can be used for evaluation. Since 
some of the criteria are disc^*3sed in an earlier section, they 
are not repeated here. 

How feasible is the use of an item bank for test 
development? This issue was investigated by Brzezinski and 
Demaline (1982). After comparing test development under the 
traditional approach with test development using an item bank in 
terms of both costs and outcomes of test development, they 
concluded that more advantages can be gained by using an item 
bank. We have to keep in mind that these two different 
approaches are not directly comparable. 

Proper use of item banks depends on their intended use and 
the quality of items stored in the banks. Without good quality 
items, item bank applications are not likely to produce good 
outcomes. In addition to test assembly for individual 
instructors, item banks have been used to monitor pupil 
pexf^^mance (e.g., Johnson and Maher, 1982), and to establish 
and maintain cutoff scores for teacher certification 
examinations (Legg, 1982). One application that is focused on 
here is the use of item banks in adaptive testing. The most 
coAuAon application is to use estimates obtained from item 
response theory as indicators of the quality of items used in 
the bank. 

As shown by Jensema (1977), Bayesian decisions were 
affected by the characteristics of the item bank. What 
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characteristics of the item bank should be of primary concern? 
Most research has focused on the following main issues: 

C*) What is the minimum number of items required for the 
bank? 

(2) Should the one- or three-parameter model be used? 

(3) What is the minimum sample size required to calibrate 
the items? 

(4) What is the minimum number of Items required for a 
test? 

t'l^ce these four issues are related to each other, our 
discussions of these issues obviously cannot be completely 
separated one from the other. 

The size of Icem banks varies greatly from one to another. 
Examples cited by Wright and Bell (1984) range from 51 items to 
9452 items. Naturally one may wonder what minimum number of 
items is required for an iteir bank. The issue will not be an 
issue If good items are available. For the reason of economy, 
however, users of item banks may not be able to store as many 
items as desired. Also the retrieval process will be slowed 
down substantially and the classification procedure will be very 
complicated when a large item bank is involved. On the other 
hand, too few items are more likely to create serious problems 
for testing than too many items. Therefore, the general rule of 
"the more the better" (Millman and Arter, 198/) seems 
reasonable. 

Most studies concerning t; is issue are usually in the 
context of adaptive testing. Ree (1981) conducted a simulation 



study to investigate the effects of item calibration, sample 
size, and item pool size on adaptive testing. Using the 
three-parameter model, calibrated item pools of 100, 200, and 
300 items with calibration sizes of 500, 1,000, and 2,000 were 
examined. Based on the reduction of absolute error of ability 
estimates, he cop>cluded that a minimum of 200 items with 
calibration size of 2,000 subjects is required. Sizes between 
200 items and 100 items may be adequate if the items have high 
discrimination power and a wide range of difficulty (Reckase, 
1981). Urry (1977) also emphasized the quality of items in the 
bank. In addition to the requirement for a minimum of 100 
items, the item discrimination parameter must exceed .8 and the 
item difficulty parameter must be spread evenly and widely. 

Weiss and Kingsbury (1984) also concurred that a minimum of 
100 items is acceptable. Green et al. (1982) indicated that the 
United States Armed Services are planning to develop c pool of 
200 items for each of the ten proposed computerized adaptive 
Armed Services Vocational Aptitude Battery tests. Since 
unidimensionality is assumed for IRT, this item pool size should 
be considered as the minimum requirement for measuring one 
particular trait. The minimum number of items required for an 
item bank can be determined if the number -f traits or contents 
to be measured is decided. 

Item statistics are useful indicators of the quality of 
item. We should not Ignore them simply because they may be 
misused by users who believe only in statistical criteria in 
Judging item. Because of this position, our concern is the 
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Issue of what kinds of item data should be collected rather than 
the issue of whether items should be calibrated as discussed b'; 
Millnan and Arter (1984). 

Three kinds of item data are most commonly used: classical 
item statistics, estimates based on the Rasch model (IPL) and 
estimates based on the three-parameter logistic model (3PL}. 
One of the major factors to be considered in deciding which kind 
of statistics to be used is probably the sample Lize available 
for item calibration. For small classroom testing, calibration 
of items based on item response theory is not possible. Some 
classical item statistics may be appropriate for small classroom 
testing (Nitko & Hsu, 1983). These statistics may be computed 
and stored along with the items in item banks. For calibration 
using the 3PL model, 1,000 subjects per item is required (Green 
et al., 1982, 1984; Weiss & Kingsbury, 1984). However, 
Hambleton and Cook (1963) have shown that for a 20 item test, 
the increase in the precision of the standard error of ability 
estimate from a calibration size of 200 to 1,000 is relatively 
small. When the sample size is less than 200, Lord (1983) 
showed that the Rasch model performs slightly better than the 
two-parameter model. 

It seems reasonable to say that the superiority of the 3PL 
model cannot be exhibited unless a large calibration sample is 
available. Also, if the item pool is small, say 40, the 3PL 
model does not show any advantage in tailored testing either 
(McKinley & Reckase, 1984). These results and other 
complications in using the 3PL model, such as the possioility of 
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non-convergence, may lead one to conclude that the IPL model 
should be used instead of the 3PL model. Such a conclusion 
obviously is premature. These results only imply that the 3PL 
model is not any better than the IPL model when they are 
compared under less than desirable conditions. Also, these 
results do not prove that the IPL model is accurate. Instead # 
these results only imply that if the 3PL model is employed under 
desirable conditions, the results obtained have a better chance 
of being accurate. 

Test Design 

Test design using traditional item statistics does not 
really require computer assistance. Most users subjectively 
decide whether or not an item should be included after examining 
item information. At most, they may estimate the reliability 
coefficient of the newly designed test using item statistics 
obtained previously. This process has been computerized by some 
systems (e.g., Nitko & Hsu, 1984a). Nevertheless, we cannot 
find any research studies dealing specifically with this issue. 

Computer assistance is most likely to be required if item 
response theory is utilized in test design. Since such an 
application is relatively new, there are rather few studies 
available in this area. The only example which we may discuss 
is the system developed by Sadock (1984), which was illustrated 
previously. In that system, the developer utilized four 
measurement specialists and nine users (teachers) to tryout the 
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system. In addition to comment on specific sections of the 
system^ time required for test design was also recorc'^d and 
analyzed. At the end, the users also completed a rating form to 
reflect their impressions of the system. 

In addition, to fully assess the capacity and efficiency of 
the system, as well as the quality of the tests produced, the 
researcher designed 144 experimental tests. For these trial 
runs, the following test characteristics were varied: (a) type 
of test or Intended use of test results (i.e., tests were 
designed either to group students, rank students, or assess 
individual student mastery); (b) degree of specificity of the 
content domain (which affects the size of the item pool); (c) 
desired test length; (d) the number of and value of the 
cutoff -score (s) when classification tests were developed; and 
(e) the range of class ability level when a test designed to 
rank students was specified. For each experimental test, length 
of time required to design the test was recorded. Test 
information curves were plotted and compared to the theoretical 
test information curve for each type of test, to determine 
whether maximum information was indeed obtained at the critical 
decision-making point (s) on the ability scale. Although the 
evaluation study was quite extensive in coverage, the system was 
not evaluated by computer specialists. 



Test Administration 



Research concerning computerized test administration may be 
classified into two major categories: (a) research designed to 
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compare computerized testing with conventional testing 
procedures, and (b) research designed to study adaptive testing 
strategies which may be computerized. 

Rold (1984b) listed 11 studies published between 1969 
through 1984 comparing computerized testing with conventional 
testing. Tests Involved were standardized Intelligence test and 
personality Inventories. Most of the studies found no 
significant difference between the two testing modes. Seme of 
the major findings Include: (a) a high state of anxiety under 
computerized testing; (b) more honesty, openness, and 
willingness to respond under computer administration; and (c) 
the detection of unexpected responses under computerized 
testing. In general, subjects' familiarity with the computer 
seems to affect their performance under computerized testing. 

One major advantage of computerized testing Is the ability 
to adapt the test to subjects' ability level. Since studies 
listed In Rold (1984b) are nonadaptlve, nonsignificant findings 
In most studies should not be a surprise. Merely simulating 
paper and pencil tests on computers Is not a good way to utilize 
computer technology. Unless computerized testing can do a 
better job than regular paper and pencil testing. It Is not 
justifiable to use expensive computer testing to replace 
relatively inexpensive paper and pencil testing which can be 
administered in a large group simultaneously. 

Since so many studies on adaptive testing strategies have 
been generated during the last decade, it is not possible to 
cover them in a few pages. Instead, only a few representative 
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studies are discussed. The focus here is on strategies of item 
selection when an adaptive test administration approach is 
employed. 

As mentioned previously, computerized adaptive testing was 
most successfully applied to aptitude testing. In this sect ion « 
we discuss two studies dealing with the evaluation of such an 
application. The first is an empirical investigation of the 
Broad Range Tailored Test of Verbal Ability (BRTTVA) developed 
by Lord (1977), which was described in a previous section. This 
investigation was conducted by Kreitzberg and Jones (1980). To 
carry out the study, the researchers developed a computer system 
that can administer two forms of the BRTTVA. Each form of the 
test consisted of 25 items and the administration of the two 
forms was counterbalanced. The BRTTVA was administered to 146 
high school students. A questionnaire to measure examinees' 
attitudes toward the testing was administered at the ends. 

Data analyzed and presented included descriptive 
characteristics of the observed data, information functions of 
both forms, reliability and validity, and the performance of the 
maximum-likelihood estimators. Since this estimator is the key 
to the selection of items, a brief description of its 
performance is warranted. The item selection procedure was 
investigated by a Monte Carlo analysis. To compare the actual 
item selection procedure and the ideal situation, scattergrams 
were plotted to show the relationship between the nuuber of 
correct responses and the final estimate of ability. Ideally, 
no regresrion would be expected. The results show some 
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regressions. In addition, these graphs were also compared with 
graphs obtained by simulating the responses of examinees using 
the estimated ability* Some discrepancies were noted. This 
Implies that the Item selection process Is In need of further 
Improvement . 

The second example Is an evaluation plan developed for the 
Navy by Green, et al . (1982). Suggested areas of evaluation 
Include Item content, reliability, validity. Item parameters. 
Item pool characteristics. Item selection and test scoring, 
stopping rules and so on. In terms of Item selection, three 
methods were suggested: the Bayes updating method proposed by 
Owen (1969, 1975), the maximum Information method proposed by 
Lord (1977), and a finite Bayes method proposed by Bock and 
Aitkin (1981) . 

Although this report (Green et al., 1982) Includes only an 
evaluation plan. Its recommendations for evaluations should be 
considered seriously by anyone who Is planning to develop an 
adaptive testing system. The recommendations cited below 
address the efficiency of Item selection In adaptive testing: 

1. "The procedure for Item selection and ability estimation 

must be documented explicitly and In detail." (p. 52) 
2* "The procedure should Include a method of varying the 
Items selected, to avoid using a few Items 
exclusively. " (p. 52) 
3. "The procedure used should Include a mechanism to 

maintain a rough balance of correct answer options." 
(p. 62) 
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4. "The computer algorithm must be capable of administering 

designated items, and recording the re-^ponse 
separately, without interfering with the adaptive 
process. " (p. 53) 

5. "The computer system must be able to base the choice of 

a first item on prior information." (p. 53) 
The nature of research on adaptive testing strategies during 
the 1970s is probably best represented by the final report 
prepared by Weiss (1976). Since 1973, a group of researchers at 
the University of Minnesota, under the leadership of David 
Weiss, has dedicated itself to the study of computerized testing 
and produced many technical reports. This 1976 final report is 
a summary of their efforts during the first three years. The 
objectives of their investigation were: (a) to develop and 
implement the stratified adaptive ability testing using 
computers, (b) to compare various strategies for adaptive 
testing, (c) to study the effect of item selection and feedback 
on ability test scores, and (d) to assess the usefulness of test 
information for diagnostic purposes. 

Included among the 2^ major f J^^dings presented by Weiss 
(1976) are the following: 

(a) The rankings of adar ive strategies, in terms of logical 
analysis, are Bayesian turn likelihood, stradaptive, 
pyramidal models, and i alevel. 

(b) Based on information curves, stradaptive and Bayesian 
are most desirable and flexilevel is least desirable. 
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(c) The Bayesian approach has certain weaknesses that will 
limit its utility. 

(d) In addition to its logical appeal, simulation results 
show that "the stradaptive test appears to provide the best 
realization of the ideal of measurement with equal and high 
precision of all trait levels" (p. 3) 

Since our interest in adaptive test administration is 
limited tc item selection strategies, we will discuss only two 
additional research issues below. Readers interested in other 
Issues of adaptive testing may wish to consult Clark (1976) and 
Weiss (1974, 1978, 1980, 1983). 

The first issue to be addressed is related to the Bayesian 
decision strategy. This strategy has attracted a great deal of 
interest since the publication of Owen's studies (1969, 1975). 
Weiss (1974) and his associates (Vale & Weiss, 1975; McBride & 
Weiss, 1976) have found some strengths and some weaknesses. In 
terms of desirable characteristics of adaptive testing anC 
information curves, the Bayesian strategy is ranked as one of 
the highest among the various strategies compared. However, the 
obtained ability estimates were found to be highly correlated 
with test length. Although the estimates are not equally 
precise throughout all ability levels, the obtained scores seem 
to be related to the prior ability estimate used. 

This strategy has been implemented by Urry (1975). The 
results seem promising, but the strategy has never been 
evaluated under real life testing situations on a large scale. 
With the development of computer technology, the computational 
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aspects of the strategy should be feasible even using 
microcomp*'sters. Further research on this strategy should be 
encouraged. 

The second issue is related to adaptive achievement 
testing. A great deal of theoretical research has been done on 
adaptive aptitude testing. Relatively little research has been 
done on achievement testing, possibly due to the difficulty in 
dealing with multi-trait assessment in item response theory. 
Recent developments in multivariate methods should be able to 
provide a theoretical foundation for multi-content achievement 
testing (Roid, 1984b, Embretson, 1985). Without a theoretical 
base, earlier attempts to investigate adaptive achievement 
testing usually made decisions about each objective 
independently. Branching between objectives was established in 
advance either through logical analysis or hierarchical 
analysis. Weiss and his associates (Brown & Weiss, 1977; 
Gialluca & Weiss, 1979) proposed an inter-sub Ject branching 
strategy for achievement testing. More theoretical studies in 
this are are badly needed. Since achievement testing is so 
important in the educational enterprise, adaptive achievement 
testing should have a great deal of potential. Item selection 
strategies for this type of achievement testing should prove to 
be a challenging issue. 

To evaluate computer administered tests, a set of criteria 
proposed by Millman (1984) should be considered. These criteria 
are; (a) cost efficiency, (b) comparability to measure what is 
desired, (c) feasibility, (d) contribution to instruction. 
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(e) precision of measurement: testing time, (f) security, (g) 
concern for the Individual, (h) ease of scoring, and (1) 
fairness. This set of criteria should be applicable l:o both 
adaptive and nonadaptlve testings « In addition, guidelines 
being prepared by the American Psychological Association (1985) 
should be considered. 
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V. 



PROSPECTS FOR THE FUTURE 



Many attempts have been made to explore the possibility of 
using computer technology to assist in the construction of tests 
dux/.ng the last 20 years. Actual successful applications on a 
large scale are relatively few. For those applications that are 
actually in operation, most are simply replacing paper and 
pencil tests or human labor by the computer. With the exception 
of adaptive testing, it is very difficult to find any 
documentation which shows that the quality of assessment 
processes is improved as a result of using the computer. Is the 
quality of items improved because we can generate and construct 
items by using the computer? Is the quality of tests improved 
because the computer can be used to bank items and/or to design 
tests? Is the quality of testing improved because of the 
feasibility of computerized adaptive testing? We have some 
positive evidence for the last question. But computerized 
adaptive testing procedures have not been implemented on a large 
scale and are limited to aptitude testing. This limited success 
could be attributed to various reasons (Roid, 1984b). One of 
the reasons is probably related to computer technology. Before 
microcomputers, the accessibility of computers was a problem for 
most users. The development of new hardware and the portability 
of software between different hardware are also causes for the 
limited implementation. This difficulty may be reduced with the 
availability of microcomputers. Another reason which we believe 
is very crucial, is an overwhelming neglect of assessing 
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measurement quality In applying computer technology. Although 
there are many testing packages available on the mark::^t, good 
products are difficult to find. We are so fascinated by the 
technology that we want to do everything using the computer. 
But we must ask whether the computer can Improve the quality of 
the activity or nwC. It Is all right to be concerned about the 
technical quality such as beautiful color graphics and short 
response time, but we must also pay attention to the quality 
from a measurement perspective. If we can overcome these 
difficulties t the future of computer-assisted test construction 
should be very bright. 

In the remaining sections, an attempt Is made to outline 
some prospects for the future of computer-assisted test 
construction. 

Item Construction 

Using the computer to construct Items Is very useful, but 
not easy. Although the success Is rather limited at this point, 
the potential Is relatively great. To be successful, we may 
work from various directions: 

(a) The first priority is to develop more item construction 
theories that can take advantage of artificial 
Intelligence and the phrase recognizability of the 
computer. More specifically, Millman (1980) suggested 
that we have to develop a high level computer language 
specifically for item writing purposes and improve our 
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domain specification strategies to make them feasibJe 
for computer item generation. (See also Roid, ld84b.) 

(b) Effort should be made to develop different item types 
which can take advantage of the computer capability 
(Hambleton, 1984; Johnson, 1983; Wood, 19B4}. 
Traditional item types are designed for paper and 
pencil tests. Routinely using the computer to 
construct and administer items of the traditiornl types 
is not a desirable approach for using the computer. We 
musL take advantage of the computer's capabilities to 
Improve our teeming by developing new jtem types. For 
example, we may be able to use graphics to simulate 
test item conditions for problem solving. Some 
prototype items have been developed already by Hunt and 
his associates (Hunt, 1985) to measure spatial ability. 

(c) Studies should be conducted to evaluate parallel tests 
generated by the computer (Mlllman, 1980). 
Interpretation guidelines for parents concerning test 
results obtained from different forms of a test should 
be considered also. 

(d) There is a need to develop software which can construct 
reading comprehension tests based on textbooks. (Roid, 
1984a) 
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Item Banking 



For an item bank to be used widely, public agencies and 
textbook publishers must be involved. Development of item banks 
in the area of criterion-referenced achievement tests should be 
encouraged. Item banks developed and distributed with textbooks 
by publishers are becoming popular these days (e.g., PRISM 
developed by the Psychological Corporation (1982) and Academic 
Institutional Measurement System (AIMS) distributed by Charles 
B. Merrill Publishing Co.). But we believe good general purpose 
item banking systems should also have good potential. The 
meaning of general purpose implies that all users can store 
their own items. The item classification scheme should be 
general enough for most purposes. 

There is a possibility that general data base programs will 
be used more often in developing item banking programs (Deck & 
Esteci, 1984). Since more computer knowledge is required in 
using a general purpose data base, this prediction may be true 
for professional test developers rather than for common test 
users . 

Another approach which may be considered Is to develop item 
form ban!'3. Instead of storing items, item forms (or other item 
generation techniques) may be stored. Items will be generated 
by the computer when a specific form ir selected. This approach 
combines Item generation with item barking. It should eliminate 
the concern about storage space for items. 
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Test Design 



Current test design using IRT depends on mainframe computers 
to calibrate the Items first. Then, the estimates of Item 
parameters are transferred to microcomputers. With the 
development of the second generation of microcomputers, test 
design using IRT may not require mainframe computers any more. 
MlcroCAT (Assessment Systems Corporation, 1985) has already 
Incorporated such a capability Into their system. If Items can 
be calibrated and the test can be designed In one system, IRT 
may be used to develop tests by teachers who ' . f very little 
about measurement theories. 

Another factor that will affect future test design Is the 
new development In the area of multivariate methods. These 
methods make the study of the quality of test In. vlng 
multldlmenslons possible (Roid, 1984b). 

Test Administration 

The development of test administration using computers 
depends on the development of good Item banks. In order to 
speed up the process of developing good banks, textbook 
publishers and public agencies may have to be Involved In the 
development of testing systems. With the Increasing 
availability and capability of microcomputers, powerful Item 
selection strategies, such as Bayeslan and maximum likelihood, 
may be Implemented and used tn test administration on a large 
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scale. This is feasible because of the availability of 
microcomputers and the development in the area of microcomputer 
networking* 

Further developments of psychometric theories in the areas 
of achievement and diagnostic testing are needed (McArthur & 
Choppin, 1984). These theories are needed to guide the 
anticiprted popularity of computerized achievement and 
diagnostic testing. Also, there Is a need to develop item 
analysis procedures based on data obtained from individualized 
testing. It seems illogical to use data obtained from group 
administered testing to estimate item parameters which are going 
to be used for individualized testing. 

Implications for Education Testing 

The impact of the development of computer-assisted test 
construction on testing is most likely to be felt in the 
following directions: 

(a) Practicing computerized adaptive and diagnostic testing 
in classrooms, both in aptitude and achievement areas. 

(b) Applying IRT in test design by non-measurement 
specialists . 

(c) Using new item types and/or new assessment procedures 
in classrooms. 

(d) Evaluating and using items distributed by textbook 
publishers rather than teachers writing their own 
items written by teachers. 
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(e) Increasing popularity of computerized certification and 
licensing testings. 
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APPENDIX A 
IJAMPLE OF A USER'S EVALUATION FCM 
(Adapted from Ju, 1984) 

Direction: To what iegree would you agree or disagree with the following 
statements 



1. The package is useful for 
classroom testing. 



Strongly strongly Don't 

Agree Agree Disagree Disagree Know 



2. Using the package is fun. 

3. Using the package is frightening. 

4. The package is "user-sensitive" 
or "user friendly" . 

5. Instructions to run the package 
are ambiguous and difficult to 
follow. 

6. Using the package is boring. 

7. The package runs smoothly. 

8. It is hard to get back to the 
menu if the user makes a mistake. 

9. Most important classroom testing 
activities are contained in the 
package . 

10. The user can modify the program 
to fit individual needs. 

11. The package allows users to 
repeat programs as often as 
they want. 

12. The package includes too many 
programs dealing with trivial 
classroom activities. 

13. The package adapts well to the 
user's requirements. 

14. Complex computer skills are 
required to run the package. 

15. Minimum training is needed to 
use the package. 



4 
4 

4 



4 

4 
4 



3 
3 
3 



3 
3 
3 



3 
3 

3 
3 



2 
2 
2 



2 
2 
2 



2 
2 



2 
2 

2 
2 
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strongly 



Strongly Don't 



Ag 



16. Documentation is confusing 
and inconsistent with the 
package • 

17. Adequate documents are provided 
for running the package. 

18. All possible responses are 
anticipated to make the 
package's operation predictable 
and reliable. 

19. Incorrect inputs are detected 
by the package. 

20. Input alternatives are flexible. 

21. Output alternatives are flexible. 

22. Screen display is clear and 
easy to read. 

23. Output summaries are difficult 
to interpret. 

24. The package has many uncorrected 
"bugs" which cause it to behave 
inconsistently or to "crash','. 

25. Feedback is ineffective and 
Inappropriate. 

26. Overall xhe response time (the 
time lag between your request 
and the response by the computer) 
is reasonable. 

27. Error messages are confusing. 

28. Adequate procedures are 
incorporated to prevent the 
user's errors. 

29. The package achieves its intent. 

30. This is an excellent package; 
recommend without hesitation. 



ee Agree Disagree Disagree Know 



3 
3 



3 

3 
3 
3 



3 
3 

3 
3 



2 
2 



2 
2 
2 

2 

2 

2 
2 



2 
2 

2 
2 
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